Skip to content

Add subscription/defer observability attributes and metrics#8858

Open
rohan-b99 wants to merge 19 commits intodevfrom
rohan-b99/subscription-observability-heartbeat-payload-errors
Open

Add subscription/defer observability attributes and metrics#8858
rohan-b99 wants to merge 19 commits intodevfrom
rohan-b99/subscription-observability-heartbeat-payload-errors

Conversation

@rohan-b99
Copy link
Contributor

@rohan-b99 rohan-b99 commented Feb 3, 2026

Adds new span attributes and metrics to improve observability of streaming responses.

Span attributes:

  • apollo.subscription.end_reason: Records the reason a subscription was terminated. Possible values are server_close, subgraph_error, heartbeat_delivery_failed, client_disconnect, schema_reload, and config_reload.
  • apollo.defer.end_reason: Records the reason a deferred query ended. Possible values are completed (all deferred chunks were delivered successfully) and client_disconnect (the client disconnected before all deferred data was delivered).

Both attributes are added dynamically to router spans only when relevant (i.e., only on requests that actually use subscriptions or @defer), rather than being present on every router span.

Metrics:

The following counters are emitted when a subscription terminates:

  • apollo.router.operations.subscriptions.stream_end (attributes: subgraph.service.name): The subgraph gracefully closed the stream.
  • apollo.router.operations.subscriptions.subgraph_error (attributes: subgraph.service.name): The subscription terminated unexpectedly due to a subgraph error (e.g. process killed, network drop).
  • apollo.router.operations.subscriptions.client_disconnect (attributes: apollo.client.name): The client disconnected before the subscription ended.
  • apollo.router.operations.subscriptions.heartbeat_delivery_failed (attributes: apollo.client.name): A heartbeat could not be delivered to the client.
  • apollo.router.operations.subscriptions.schema_reload: The subscription was terminated because the router schema was updated.
  • apollo.router.operations.subscriptions.config_reload: The subscription was terminated because the router configuration was updated.

The following counter is emitted when a subscription request is rejected:

  • apollo.router.operations.subscriptions.rejected.limit: A new subscription request was rejected because the router has reached its max_opened_subscriptions limit.

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@apollo-librarian
Copy link

apollo-librarian bot commented Feb 3, 2026

✅ Docs preview ready

The preview is ready to be viewed. View the preview

File Changes

0 new, 2 changed, 0 removed
* graphos/routing/(latest)/observability/router-telemetry-otel/enabling-telemetry/standard-instruments.mdx
* graphos/routing/(latest)/operations/subscriptions/configuration.mdx

Build ID: adf329da55809814b64cc9df
Build Logs: View logs

URL: https://www.apollographql.com/docs/deploy-preview/adf329da55809814b64cc9df

@github-actions

This comment has been minimized.

@rohan-b99 rohan-b99 changed the title Rohan b99/subscription observability heartbeat payload errors Add apollo.subscription.end_reason attribute to subscription spans Feb 3, 2026
@rohan-b99 rohan-b99 force-pushed the rohan-b99/subscription-observability-heartbeat-payload-errors branch from a5c8dad to b9ce87b Compare February 4, 2026 15:21
Copy link
Contributor

@carodewig carodewig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! One comment/question below

Copy link
Contributor

@carodewig carodewig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit which does this (although the reason is now stored in a enum in the Multipart struct) - let me know if that's better or if you had something else in mind

I think it's easier to understand now, but happy to be overruled!

response.has_next.unwrap_or(false) || response.subscribed.unwrap_or(false);

// Check for reload-related termination before errors are moved
let end_reason = Self::detect_reload_end_reason(&response.errors);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something - but is this variable ever used / stored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used further down but only for subscriptions if the connection was no longer open - I did a small refactor to make it optional as at this point we don't know for sure if it has closed, the response gets moved a bit lower down though so I think it does need to be checked here

@rohan-b99 rohan-b99 marked this pull request as ready for review February 12, 2026 14:34
@rohan-b99 rohan-b99 requested review from a team as code owners February 12, 2026 14:34
@rohan-b99
Copy link
Contributor Author

Hi @theJC, just tagging you here to see if you had any feedback on this implementation - I understand the initial request for more observability on heartbeats/disconnections for subscriptions/@defer requests came from you a while ago, looking to see if these two new attributes align with expectations

Copy link
Contributor

@pragl pragl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some suggestions on the docs changes to align with our style guide. Please consider them when you get the chance. Thanks!

rohan-b99 and others added 2 commits February 12, 2026 17:24
/// Used to detect if a heartbeat was the last thing sent before connection closed.
heartbeat_pending: bool,
/// The span captured at creation time, used to record attributes on connection close.
span: Span,
Copy link
Contributor

@theJC theJC Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend naming this creation_span rather than span

Comment on lines 185 to 200
let end_reason = self.end_reason.take().unwrap_or_else(|| match self.mode {
ProtocolMode::Subscription => {
// Stream wasn't terminated properly - determine the reason
let reason = if self.heartbeat_pending {
// Heartbeat was the last thing sent - likely failed to deliver
SubscriptionEndReason::HeartbeatDeliveryFailed
} else {
// Connection closed after a message was sent
SubscriptionEndReason::ClientDisconnect
};
EndReason::Subscription(reason)
}
ProtocolMode::Defer => {
// Defer stream wasn't terminated properly - client disconnected
EndReason::Defer(DeferEndReason::ClientDisconnect)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner if this complex logic be extracted into a function infer_abnormal_termination_reason in Multipart ?

@theJC
Copy link
Contributor

theJC commented Feb 14, 2026

Awesome, this looks to be a nice improvement to TRACE observability into subscription/defer activity !!!

An additional importantly needed observability is to have instruments/metrics emanated for the important un-happy conditions,

ie:

  • count of subscriptions failed due to hitting the max_open_subscriptions limit
  • count of stream_end / subgraph closing unexpectedly (by subgraph name)
  • count of client_disconnect (by client name)
  • count of heartbeat_delivery_failed (by client name)

Its important to be able to track how often these occur to track overall health of the subscription activity happening with Router, so anomalies in these trends can be detected and then acted upon.

Copy link
Contributor

@mabuyo mabuyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pragl 's comments were resolved so approving on behalf of Docs!

@rohan-b99 rohan-b99 changed the title Add apollo.subscription.end_reason attribute to subscription spans Add subscription/defer observability attributes and metrics Feb 19, 2026
@rohan-b99
Copy link
Contributor Author

Awesome, this looks to be a nice improvement to TRACE observability into subscription/defer activity !!!

An additional importantly needed observability is to have instruments/metrics emanated for the important un-happy conditions,

ie:

  • count of subscriptions failed due to hitting the max_open_subscriptions limit
  • count of stream_end / subgraph closing unexpectedly (by subgraph name)
  • count of client_disconnect (by client name)
  • count of heartbeat_delivery_failed (by client name)

Its important to be able to track how often these occur to track overall health of the subscription activity happening with Router, so anomalies in these trends can be detected and then acted upon.

Thanks for your feedback on this! Appreciate the input around metrics - I've implemented these along with some other refactors to make termination reasons clearer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments